1 Draft Outline: Logistic Regression in Ecology

1.1 Learning outcomes

By the end of this training, you will be able to:

  • List main uses of logistic regressions
  • Describe logistic regression assumptions
  • Apply logistic regressions to your own data using R programming language
  • Evaluate model fits and results

1.2 What is a logistic regression?

Logistic regressions are used to understand the association between a response variable, which is binary, and selected predictors. We can commonly find this type of analysis in ecological studies where the binary response variable consists of the presence (1) or absence (0) of a given species, resulting in predictions of habitat suitability.

1.3 How are logistic regressions different from linear regressions?

In a linear regression, we can predict quantitative outcomes based on the value of one or more predictors using a straight line. We can calculate correlation and test for significance of the regression. We can also compare different types of model, from more simple, with single predictors, to more complicated, with several predictors and interactions, and then find the ones that provide a best fit to our data. In a logistic regression, we can also do all of that. The main difference is that our outcomes are binary as opposed to continuous measurements. Examples of binary outcomes are presence versus absence, as we mentioned before, or positive to a disease versus negative, or dead versus alive. These outcomes can be categorized as 1 (success) and 0 (failure), as in the hypothetical data below:

# Create and visualize hypothetical data
set.seed(123)
df <- data.frame(predictor = runif(100, 0, 10))
df$outcome <- rbinom(100, 1, 1 / (1 + exp(-1 * (0.5 * df$predictor - 2.5))))

knitr::kable(head(df))
predictor outcome
2.875775 0
7.883051 1
4.089769 0
8.830174 0
9.404673 1
0.455565 0

When we plot this type of binary data, we see that observations cluster around either of the two outcome possibilities.

This type of data is best fit by an s-shaped curve instead of a line. And this is another difference between linear and logistic regressions.

1.4 What does the logistic curve mean?

The logistic curve represents the probability of a given positive outcome depending on a given predictor. As we move along the curve which progresses as we change our predictor, we go from 0 to 100% probability of our outcome. As we are interested in classifying our outcomes in success or failure, we establish a threshold along the logistic line: above that threshold, all outcomes will be positive (1), while below that, everything will be negative (0).

1.5 How are logistic regression models represented mathematically?

Before we can correlate variables in our models, we transform the response variable to get a linear relationship between variables. This transformation is called logit, and consists of the log of the probability of success/probability of failure, so that our explanatory variable is in log(odds). This is the equation for the logistic regression:

\[ Log (p/1-p) = b0+b1*x1+e \] Log (p/1-p): response variable
b0: y intercept
x: predictor variable
b1: slope for explanatory variable 1
e: error

If we had more explanatory variables, we could keep adding them to the equation above as b2*x2 and so forth.

1.6 What kind of predictors can we have in a logistic regression?

Just like in a linear regression, we can use continuous and/or discrete variables to make predictions. Here are some examples:

Continuous variables

Discrete variables

  • Aquatic vegetation and hydroperiod to predict amphibian species richness, where aquatic vegetation has the following levels: (1) none or low; (2) medium–low; (3) medium–high and (4) high, and hydroperiod can be permanent (pond maintains water all year) or temporary (period of drought during the year) (Ribeiro et al., 2011)

1.7 Logistic regression model assumptions

  • Dependent variable has 2 levels or is a proportion of successes
  • Observations need to be independent
  • Normal distribution of data or residuals is not needed
  • Overdispersion should not be a big issue

1.8 Steps to run a logistic model in R

1.8.1 Load dataset and view potentially interesting variables

setwd("C:/Users/user/Desktop/DFO_training")
library("readxl")
# xlsx files
ham <- read_excel("ind dates flat all parameters Hamilton only FOR Kim.xlsx")

analysis <- ham
knitr::kable(analysis[c(25:30), c(2,8,26,47,98)],row.names = FALSE, digits=3, align=rep('c', 3),
             col.names = c("Location", "Season", "Total phosphorus", "Filamentous diatom biomass", "Epimilion temperat."))
Location Season Total phosphorus Filamentous diatom biomass Epimilion temperat.
deep 2 0.008 0.000 21.788
deep 2 0.008 3.100 21.826
NE 1 0.020 NA 19.891
deep 1 0.020 NA 19.706
NE 1 NA NA 13.897
deep 1 NA 287.182 16.049

1.8.2 Remove NA data

analysis <- analysis[!is.na(analysis$filamentous_Diatom), ]
analysis <- analysis[!is.na(analysis$mean_mixing_depth_temp), ]

1.8.3 Create a new column to describe presence or absence of filamentous diatoms

Ensure the reference group is the first to be shown

analysis$filam_presence <- ifelse(analysis$filamentous_Diatom > 0, "present", "absent")
analysis$filam_presence <- factor(analysis$filam_presence)
levels(analysis$filam_presence)
## [1] "absent"  "present"

1.8.4 Subset to analyse summer conditions and remove records in deep locations

analysis <- subset(analysis, (season==2 | season==3) & (!area_group=="deep"))

1.8.5 Run model using glm function and family binomial

If using proportions, provide “weights” information (total number of trials, or events where we got success + events where we got failure)

model <- glm(filam_presence ~ mean_mixing_depth_temp,
             data = analysis, family = binomial)

1.8.6 Check model results

  • Odds ratio: probability of success / probability of failure
summary(model)
## 
## Call:
## glm(formula = filam_presence ~ mean_mixing_depth_temp, family = binomial, 
##     data = analysis)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             10.1581     3.8402   2.645  0.00816 **
## mean_mixing_depth_temp  -0.4202     0.1736  -2.420  0.01551 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 79.807  on 69  degrees of freedom
## Residual deviance: 72.195  on 68  degrees of freedom
## AIC: 76.195
## 
## Number of Fisher Scoring iterations: 5

1.8.7 Plot model predictions

  • Shades are confidence intervals
library(sjPlot)
plot_model(model, type="pred", terms=c("mean_mixing_depth_temp[all]"))

1.9 R resources

1.10 References